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Abstract — This paper studies the robust optimal control design 
for uncertain nonlinear systems from a perspective of robust 
adaptive dynamic programming (robust-ADP). The objective is 
to fill up a gap in the past literature of ADP where dynamic 
uncertainties or unmodeled dynamics are not addressed. A key 
strategy is to integrate tools from modern nonlinear control the- 
ory, such as the robust redesign and the backstepping techniques 
as well as the nonlinear small-gain theorem, with the theory of 
ADP. The proposed robust-ADP methodology can be viewed as 
a natural extension of ADP to uncertain nonlinear systems. A 
practical learning algorithm is developed in this paper, and has 
been applied to a sensorimotor control problem. 

I. Introduction 

Reinforcement learning (RL) ll30l is an important branch in 
machine learning theory. It is concerned with how an agent 
should modify its actions based on the reward from its reactive 
unknown environment so as to achieve a long term goal. In 
1968, Werbos pointed out that the policy iteration technique 
devised in |6| for dynamic programming can be employed to 
perform RL [34|. Starting from then, many real-time RL meth- 
ods for finding online optimal control policies have emerged 
and they are broadly called approximate/adaptive dynamic 
programming (ADP) |35|, |36 | or neurodynamic programming 
0. See m, E, |7|, |22J, 124), El, HHl, m, for some 
recently developed results. 

In the past literature of ADP, it is commonly assumed that 
the system order is known and the state variables are either 
fully available or reconstructible from the output; see iBTI . Il22l 
and reference therein. However, in practice, the system order 
may be unknown due to the presence of dynamic uncertainties 
(or unmodeled dynamics), which are motivated by engineering 
applications in situations where the exact mathematical model 
of a physical system is not easy to be obtained. Of course, 
dynamic uncertainties also make sense for the mathematical 
modeling in other branches of science such as biology and 
economics. This problem, often formulated as robust control, 
cannot be viewed as a special case of output feedback control, 
and the ADP methods developed in the past literature may 
not only fail to guarantee optimality, but also the stability 
of the closed-loop system when dynamic uncertainty occurs. 
To fill up the above-mentioned gap in the past literature of 
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ADP, we recently developed a new theory of robust adaptive 
dynamic programming (robust-ADP) [8 |, [[TOl . ifTTI . which can 
be viewed as a natural extension of ADP to linear and partially 
linear systems with dynamic uncertainties. 

The primary objective of this paper is to study robust- 
ADP designs for genuinely nonlinear systems in the presence 
of dynamic uncertainties. We first decompose the open-loop 
system into two parts: The system model (ideal environment) 
with known system order and fully accessible state, and 
the dynamic uncertainty, with unknown system order and 
dynamics, interacting with the ideal environment. In order 
to handle the dynamic interaction between two systems, we 
then resort to the gain assignment idea fT4l> fT5l> ['26'|. More 
specifically, we need to assign a suitable gain for the system 
model with disturbance in the sense of Sontag's input-to-state 
stability (ISS) |29j. The backstepping, robust redesign, and 
small-gain techniques in modem nonlinear control theory are 
incorporated into the robust-ADP theory, such that the system 
model is made ISS with an arbitrarily small gain. At last, the 
nonlinear small-gain theorem [IS] is applied to analyze the 
stability for the interconnected systems. 

Throughout this paper, vertical bars | • | represent the Eu- 
clidean norm for vectors, or the induced matrix norm for ma- 
trices. For any piecewise continuous function u, \\u\\ denotes 
snp{\u{t)\,t > 0}. A function 7 : is said to be of 

class K. if it is continuous, strictly increasing with 7(0) — 0. 
It is of class /Coo if additionally 7(5) — ;> 00 as s — > 00. A 
function /3 : R+ x R+ M+ is of class ICC if l3{-,t) is of 
class /C for every fixed t > 0, and (3{s, — )■ as t — > 00 for 
each fixed s > 0. The notation 71 > 72 means 71 (s) > 72(3), 
Vs > 0. 

II. Preliminaries 

In this section, let us review a policy iteration technique to 
solve optimal control problems |27|. 
To begin with, consider the system 



i = f{x) +g{x)u 



(1) 



where x £ K" is the system state, u £ K is the control input, 
/, g : M" — > K" are locally Lipschitz functions. For any initial 
condition a;o G I^"^ the cost function associated with ([U is 



defined as 

J{xo) = / [Q{x) + ru'^] dt, x{0) = xo 



(2) 



where Q{-) is a positive definite function, and r > is a 
constant. In addition, assume there exists an admissible control 
poHcy u = uq{x) in the sense that, under this policy, the 
system ^ is globally asymptotically stable and the cost (|2| is 
finite. By ll20l . the control policy that minimizes the cost (|2]i 
can be solved from the following Hamilton-Jacobi-Bellman 
(HJB) equation: 

= '7V{x)fix) + Qix) - ^ [WV{x)g{x)f (3) 

with the boundary condition V{Q) = 0. Indeed, if the solution 
V*{x) of ^ exists, the optimal control policy is given by 

u*ix)^-^gixfVV*{xf. (4) 
2r 

In general, the analytical solution of ^ is difficult to be 
solved. However, if V* (x) exists, it can be approximated using 
the policy iteration technique [27]: 

1) Find an admissible control policy uo(a;). 

2) For any integer i > 0, solve for Vi{x), with Vi{0) = 0, 
using 

= VV^{x) [f{x)+gix)u,{x)]+Qix)+ru,ixf. (5) 



3) Update the control policy using 

1 



u,+i{x) 



2r 



gixfyV,{xf. 



(6) 



Convergence of the policy iteration (|5]l and (|6]l is concluded 
in the following theorem, which can be seen as a trivial 
extension of Theorem 4 in ll27l . 

Theorem 2.1: Consider Vi{x) and Ui{x) defined in (|5]l and 
©. Then, for alH = 0,1,- -, 

V^+i{x) <V^{x), Vxei?" (7) 

and Ui{x) is admissible. In addition, if the solution V*{x) of 
^ exists, then for each fixed x, Vi{x) and Ui{x) converge 
pointwise to V*{x) and u*{x), respectively. 

III. Online Learning via Robust- ADP 

In this section, we develop the robust-ADP methodology 
for nonlinear systems as follows: 

w = A„(w,2:) (8) 
i = fix) + gix)[u + A{w,x)] (9) 

where x G R" is the measured component of the state available 
for feedback control, w G K.'' is the unmeasurable part of 
the state with unknown order p, u G M is the control input, 
A^:M.Px R" -^RP, A -.RP X M" ^ M are unknown locally 
Lipschitz functions, / and g are defined the same as in ([U but 
are assumed to be unknown. 

Our design objective is to find online the control policy 
which stabilizes the system at the origin. Also, in the absence 
of the dynamic uncertainty (i.e., A = and the w-subsystem is 
absent), the control policy becomes the optimal control policy 
that minimizes (|2]). 



A. Online policy iteration 

The iterative technique introduced in Section 2 relies on the 
knowledge of f{x) and g{x). To remove this requirement, we 
develop a novel online policy iteration technique, which can 
be viewed as the nonlinear extension of Q. 

To begin with, notice that ^ can be rewritten as 



i = fix) + g{x)ui{x) + g{x)vi 



(10) 



where Vi = u + A — ui. For each i > 0, the time derivative 
of Vi ix) along the solutions of ( fTOl i satisfies 



Vix) 



-Qix) - rufix) - 2rui+iix)vi. (11) 



Integrating both sides of ( fTTT i on any time interval [t, i + T], 
it follows that 

V,{xit + T))~VMt)) 

t+T 

[-Qix) - rufix) - 2ru,+i{x)w]dt. (12) 

Notice that if Uiix) is given, the unknown functions Viix) 
and Ui+i{x) can be approximated using (fT2] l. To be more spe- 
cific, for any given compact set C M" containing the origin 
as an interior point, let {(pjix)}'^^-^ be an infinite sequence 
of linearly independent smooth basis functions on Q., where 
0^ (0) = for all j = 1, 2, • • • . Then, for each i = 0, 1, • • • , 
the cost function and the control policy are approximated 

Ni N2 

by Viix) = Ci,j(j)jix), and Ui+iix) = Wz,j(f>jix), 

respectively, where A^i > 0, iV2 > are two sufficiently large 
integers, and Ci,j, Wij are constant weights to be determined. 

Replacing Viix), Uiix), and Ui+iix) in (fTSI) with their 
approximations, we obtain 



[(f>jixitk+i)) - (t)jixitk))] 



N2 



(13) 



2r 'Wi_jcj)jix)vidt 
[Qix) + rufix)] dt + Ci^k 



where uq = uq, Vi ~ u + A — Ui, and {tkJi-Q is a strictly 
increasing sequence with Z > a sufficiently large integer 
Then, the weights Cij and Wij can be solved in the sense of 
least-squares (i.e., by minimizing J2k=o^i k)- 

Now, starting from uoix), two sequences {Viix)}'^Q, and 
{Mi+i(a;)}^Q can be generated via the online policy itera- 
tion technique ( fTST l. Next, we show the convergence of the 
sequences to Vi{x) and Ui+iix), respectively. 

Assumption 3.1: There exist Iq > and 5 > 0, such that 
for all I > Iq, we have 



1 ' 

7 X! Cfe^*.*^ - ^Ini+N2 



(14) 



k=0 



where 



(t>i{x{tk+i)) - (j>i{x{tk)) 

4>2{x(tk+l)) - (j)2{x(tk)) 



i>Ni {x{tk+l)) - 4>Ni {x{tk)) 

2r jf'+' <i)2{x)v,{x)dt 



2r (j3N2{x)vi{x)dt 

Assumption 3.2: For all i > 0, we have x{t) e f2. 

Notice that. Assumption 13.21 is not very restrictive and can 
be satisfied if VL is an invariant set for the x-subsystem. This 
issue will be further elaborated in Section |V] 

Theorem 3.1: Under Assumptions 13.11 and 13.21 for each 
i > 0, we have 



lim V^{x) = 

JVi,JV2— >oo 

lim Ui+i{x) = Ui+i{x), 



(15) 
(16) 



for all X ^ ft. 

Proof: See the Appendix. ■ 
Corollary 3.1: Under Assumptions 13.11 and 13.21 for any 

arbitrary e > 0, there exist integers i* > 0, N* > and 
iV| > 0, such that 

\V^{x) ~V*{x)\ <e, and \u^+i{x) - u* {x)\ < e, 

for all xen,ifi> i*, Ni > N*, and N2 > N*. 

B. Robust redesign 

In the presence of the dynamic uncertainty, we redesign 
the approximated optimal control policy so as to achieve 
asymptotic stability. This method is an integration of optimal 
control theory ||20) with the gain assignment technique |[T5l . 
Il26l . To begin with, let us assume the following: 

Assumption 3.3: There exists a function a of class /Coo, 
such that for z = 0, 1, • • • , 



a{\x\) <Vi{x), Vx £ 



(17) 



In addition, assume there exists a constant e > such that 
Q{x) — e^\x\'^ is a positive definite function. 

Notice that, we can also find a class /Coo function a, such 
that for i = 0, 1, • • •, 



V,{x) < a{\x\), Vx e 



(18) 



Assumption 3.4: Consider dHJ. There exist functions 
A, A G /Coo, Ki,K2,K3 € /C, and positive definite functions 
W and K4, such that for all w G M*' and x £ M", we have 

X{\w\) <W{w) <X{\w\), (19) 
\A{w,x)\ < max{Ki(|w|), K2(|a;|)}, (20) 

together with the following implication: 

W{w) > K3{\x\) ^ \/W{w)A^{w,x) < -Ki{w). (21) 



Assumption 13 .41 implies that the w-system ^ is input-to-state 
stable (ISS) li29l when x is considered as the input, i.e., 

\wit)\<l3U\M0)lt)+jM\) (22) 

where f3w is of class ICC and 7^ is of class /C. 
Now, consider the following type of control policy 

r 



1 



iix) 



(23) 



where i* > is a sufficiently large integer as defined in 
Corollary 13.11 p is a smooth, non decreasing function, with 
p{s) > for all s > 0. Notice that Uro can be viewed as a 
robust redesign of the approximated optimal control law Ui*+i. 
As in [141, let us define a class /Coo function 7 by 



7(s) = ^ep(s^)s, Vs > 0. 



In addition, define 



(24) 



(25) 



Theorem 3.2: Under Assumptions 13.31 and 13.41 suppose 

7 > max{K2, Ki o A^^ o K3 o o a}, (26) 
and the following implication holds for some constant d > 0: 

0<V^-*{x)<d^\ero{x)\<j{\x\). (27) 

Then, the closed-loop system composed of (O, (|9]l, and ( l23b 
is asymptotically stable at the origin. In addition, there exists 
(7 G /Coo, such that ili- = {(w, x) : max [a{Vi> {x)),W{w)] < 
(y{d)} is an estimate of the region of attraction of the closed- 
loop system. 

Proof: See the Appendix. ■ 
Remark 3.1: In the absence of the dynamic uncertainty 
(i.e., A = and the w-system is absent), the control policy 
( |23] | can be replaced by Ui*+i{x), which is an approxima- 
tion of the optimal control policy u*{x) that minimizes the 
following cost function 

/•OO 

J{xo)= [Q{x) + ru^] dt, x{0)^xo. (28) 
Jo 

IV. Robust- ADP with unmatched dynamic 

UNCERTAINTY 

In this section, we extend the robust-ADP methodology to 
nonlinear systems with unmatched dynamic uncertainties. To 
begin with, consider the system: 

w = A^{w,x) (29) 
X = f{x) + g{x)[z + A{w,x)] (30) 
z = fi{x, z) + u + Ai{w,x, z) (31) 

where [x'^,z]'^ G M" X R is the measured component of the 
state available for feedback control; w, u, A^,, /, g, and A 
are defined in the same way as in (IS]!-®; /i : M" x M ^ R 
and Ai : x x M — >■ R are locally Lipschitz functions 
and are assumed to be unknown. 



Assumption 4.1: There exist class IC functions K5,kq,kt, 
such that the following inequality holds: 

\Ai{w,x,z)\ < niax{K5 , K6 (|a:|) ,K7 (1^1)}- (32) 

A. Online learning 

Let us define a virtual control policy ^ defined 
in ( l23T l. Then, a state transformation can be performed as ^ = 
z — ^. Along the trajectories of (l30ll-(l3Tli. it follows that 

C = /i(a;,z) + u + Ai-5i(x)A (33) 

where fi{x,z) = fi{x,z) - ^f{x) - %gix)z, and gi{x) = 
^g{x) are two unknown functions that can be approx- 
imated by fi{x,z) 



J2^=o^ ^gj^t^ji^)^ respectively, where {^j{x,z)}°^i is a 
sequence of linearly independent basis functions on some 
compact set Oi £ M"+^ containing the origin as its interior, 
(l)o{x) = 1, lifj and Wgj are constant weights to be trained. 
As in the matched case, 57i is selected to be an invariant set 
for the system (|30] l and (|3T| i. 

1 ) Phase-one learning: To approximate the virtual control 
input ^ for the z-subsystem, the same procedure as in (fTjt 
can be applied, with Vi = z + A — Ui. 

2) Phase two learning: To approximate the unknown func- 
tions /i and gi. The constant weights can be solved, in the 
sense of least-squares, from 



k+li 



+ 



J2wf,j^jix,z) - 

L 

{u + Ai)Cdt + ek 



E 

3=0 



Wg,3(l)3{x)A 



Cdt 



(34) 



where {t'f.}\^i is a strictly increasing positive constant se- 
quence with Z > a sufficiently large integer, and Ek denotes 
the approximation error Similarly as in the previous section, 
let us introduce the following assumption: 

Assumption 4.2: There exist /i > and 5i > 0, such that 
for alH > /i, we have 



1 



aT 



^ k=0 



N3+Ni 



(35) 



where 



ipNs{x,z)Cdt 



lim /(a;, z) 
lim g{x) 

N3,N4-^00 



Theorem 4.1: Consider (a;(0),z(0)) e fli. Then, under 
Assumption 14.21 we have 

fi{x,z), (36) 
y{x,z)en,. (37) 

Theorem 14. II can be proved following the same idea as in the 
proof of Theorem 3.1, and is omitted here for want of space. 

B. Robust redesign 

Next, we study robust stabilization of the system (l29ll-(l3Tll. 
To this end, let Kg be a function of /C such that 

KsQxl) > \^{x)\, Vx eM". (38) 

Then, Assumption 14.11 implies 

|Ai| < ma.x{K5 {\w\) , Ke {\x\) , kt {\z\)} 

< max{K5 (\w\) , Ke {\x\) , Kr (|^| + K8(|a;|))} 

< max{K5 (|w|) , Kg (|Xi|)} 

where kq{s) = maxjKg, kj o Kg o (2s), K7 o (2s)}, Vs > 0. In 
addition, we denote ki = max{Ki,K5}, K2 — max{K2,K9}, 
7i(s) — iep(^s^)s, and 

U,,{Xi) = V,,ix) + lc^. (39) 



Notice that, under Assumptions 13.31 and 13.41 there exist 
e /Coo, such that ai(|Xi|) < U^'{Xl) < ai{\Xi\). 
The control policy can be approximated by 



Urol = -fi{x,z) + 2rUi.+i{x) 
g^ix)pli\X,\')C 

pI{\x,\')C 6V(C^)C 

4 2p2(|a;|2) 

where Xi = [x'^X]'^^ pi{s) ~ 2p{^s). 
Next, define the approximation error as 

eroi(^i) = -fi{x,z) + fi{x,z) 

+2r [ui^+i{x) - Ui*+i{x)] 
[gfix)-gf{x)] p?(|Xi|2)C 



(40) 



(41) 



Then, conditions for asymptotic stability are summarized in 
the following Theorem: 

Theorem 4.2: Under Assumptions [33] [M] and gl] if 

(42) 



71 > max{K2, Ki o A ^ o o o ai}, 



and if the following implication holds for some constant di > 
0: 

< U^>{Xl) <di^ max{|e™i(Xi)|, |e™(x)|} < 7i(l^i|), 

then the closed-loop system comprised of (|29]l-(l3ni. and (|40] | 
is asymptotically stable at the origin. In addition, there exists 
(Ti e /Coo, such that 



{{w,Xi) : ma.x[a,{U^.iXi)),W{w)] < ai(di)} 



is an estimate of the region of attraction. 

Proof: See the Appendix. ■ 
Remark 4.1: In the absence of the dynamic uncertainty 
(i.e., A = 0, Ai = and the w-system is absent), the smooth 
functions p and pi can all be replaced by 0, and the system 
dynamics becomes 



Gi 



Xi = Fi(Xi)+GiUoi 
f{x)+g{x)^ + g{x)£. 

As a result, it can be concluded that the 
control policy ii = u^i is an approximate optimal control 
policy with respect to the cost function 



where Fi{Xi) = 
and Uoi = -e^C^. 



(43) 


1 



Ji(Xi(0)) = 



dt 



(44) 



with Xi(0) = [a;o ,^0 - Ui*{xo)Y and Qi (x, C) 
l^[VV.,[x)g{x)f+^^e- 

V. Implementation Issues 

In this section, we study a few implementation issues on 
the robust- ADP based online learning methodology, and give a 
practical algorithm. Due to the space Umitation, we will mainly 
focus on the systems with matched dynamic uncertainties. 
These results can be easily extended to the unmatched case. 

A. The compact set for approximation 

Assumption 5.1: The closed-loop system composed of (O, 
and 

u = uo{x) + e (45) 

is ISS when e, the exploration noise, is considered to be the 
input. 

The reason for imposing Assumption 15.11 is two fold. First, 
like in many other policy iteration based ADP algorithms, 
an initial admissible control policy is desired. In this paper 
we further assume the initial control policy is stabilizing in 
the presence of dynamic uncertainties. Such an assumption 
is feasible and realistic by means of the designs in |fT4| . Il26ll . 
Second, by adding the exploration noise, we are able to satisfy 
Assumptions 13 . 1 1 and l4.2l and at the same time keep the system 
solutions bounded. 

Under Assumption 15. II we can find a compact set JIq which 
is an invariant set of the closed-loop system compose of (O, 
dill, and u = uq{x). In addition, we can also let ilo contain 
as its subset. Then, the compact set for approximation can 
be selected as 51 = {x\3w^ s.t. {x,w) G J7o}. 

B. Two-loop optimization scheme 

In general cases, it may be difficult to determine the number 
of basis functions to be used for approximation. In this paper 
we propose a two-loop online optimization scheme as shown 
in Fig. [T] In the inner loop, least-squares method is used to 
train the weights. If the residual sum of errors is greater than 
a given threshold e > 0, in the outer loop the number of basis 
functions are increased and online data are recollected to solve 
the minimization problem until sufficient small residual error 
can be obtained. 



jz: 



) 

Inner loop 



Increase and N-^ . 
Then, recollect online data. 




Fig. 1. Two-loop online optimization scheme 

C. Robust-ADP algorithm 

The robust- ADP algorithm is given in Algorithm [T] 

Algorithm 1 Robust-ADP Algorithm 

1. Let (u'(O), x(0)) e employ the initial control policy 
(|45] | and collect the system state and input information. 

2. Apply the online policy iteration using (fT3T l. and re- 
design the control policy using ( l23T l. 

3. Terminate the exploration noise e. 

4. If {w{t),x{t)) e 57^., apply the approximate robust 
optimal control policy (l23T l. 



VI. Application to a single-joint human arm 

MOVEMENT CONTROL PROBLEM 

In this section, we apply the proposed online learning 
strategy to study a sensorimotor control problem. A linear 
version of this problem has been studied in [12J. 

Consider a single-joint arm movement as shown in Fig. |2] 
where the position of the elbow is fixed. The dynamic model 
is shown below |28|. 



le ■ 



-mgl cos{9) + n + T„ 



(46) 



where m is the mass of segment, / is the inertia, g is the 
gravitational constant, I is the distance of the center of mass 
from the joint, 9 is the joint angular position, T„i is the input to 
the muscle from the motomeurons, and n denotes the inputs 
from the neural integrator, which can be modeled by a low 
pass filter as follows with a time constant rjv. 



Let us define xi 



n 

TN 
%, X2 



T„ 



(47) 



7, w = n- 



Tj^nigL cos^fo } 

TN + l 



1x2, u = Tm — '"^rw+i""'' ' ^herc 9a is the desired end point 



7ngl cos(^?o) 

angular position. Then, the system can be converted to 

l + TN 



TN 



-{w + 1x2) 



(48) 



-2m.g/sin(y)sin(^+0o) (49) 



Xl 



X2 = 



X2 

2mgl . Xl . Xl 
^sm(-)sm(- 

+ J (li + 1X2 + w) 



(50) 
(51) 



shoulder 




hand 



elbow 



Fig. 2. Single-joint arm movement control problem. 
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Fig. 3. Comparison of the approximated cost functions. 



To apply the proposed robust-ADP method, the basis func- 
tions we used are polynomials with degrees less than or equal 
to five. The invariant set is chosen to contain the region 
{{w,xi,X2) : \w\ < l,\xi\ < 0.8, |a;2| < 3.5}. Only for 
simulation purpose, we set 9q = j, m ^ 1.65, I = 0.179, 
g = 9.81, / = 0.0779. An initial control policy is set to 
be Mo = — 0.5a;i — 0.5x2. The initial condition is set to be 
w{0) = 1, a;i(0) — — J, and 0:2(0) = 0. The optimal cost is 
specified as J = (lOOxf + x\+ m^) dt. 

In this simulation, the convergence is attained after 10 
iterations. It can be seen from Fig. |3] that the approximated 
cost function Vi{){x) is remarkably reduced compared with 
the initial approximated cost Vo(x). Also, in Fig. |4] we 
compare the speed curves under the initial control policy, the 
policy after two iterations, and the policy after 10 iterations. 
Clearly, after enough iteration steps, the speed profile becomes 
a bell-shaped curve which is consistent with experimental 
observations (see, for example, |3|). 

VII. Conclusions 

In this paper, computational robust optimal controller de- 
sign has been studied for nonlinear systems with dynamic 
uncertainties. Both the matched and the unmatched cases are 
studied. We have presented for the first time a recursive, 
online, adaptive optimal controller design when dynamic un- 
certainties, characterized by input-to-state stable systems with 
unknown order and states/dynamics, are taken into consid- 
eration. We have achieved this goal by integration of ap- 
proximate/adaptive dynamic programming (ADP) theory and 
several tools recently developed within the nonlinear control 



3.5 p 
3 - 




Intial performance 
- After 1 iterations 



0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 
Time (sec) 



Fig. 4. Comparison of the speed profiles. 



community. Systematic robust-ADP based online learning al- 
gorithm has been developed. Rigorous stability analysis based 
on Lyapunov and small-gain techniques is carried out. The 
effectiveness of the proposed methodology has been validated 
by its application to a single-joint arm movement control 
problem. 

Appendix 

Proof of Theorem 3.1 

To begin with, given iii, let Vi{x) be the solution of the 
following equation with Vi(0) — 0. 

yVr{x) (fix) + g{x)u^{x)) + Q{x) + ru^x) = (52) 

and denote Ui^i(x) ~ g(x)'^\/Vi(x)'^ . 

Lemma A.l: For each z > 0, we have lim Vi(x) ~ 

Vi{x), lim Ui^i{x) ~ Ui-f-i{x), \fx G Q,. 
Proof: By definition 



V^ix{tk+i))~V,{x{tk)) 



[Q{x) + ruf{x) + 2r^l^+l{x)v^{x)]dt (53) 



Let Ci,j and Wij be the constant weights such that Vi{x) 
Yl,°^=i^i,j4'j{x) and Ui+i(x) = Yl,%i^i.j4'j{x)- Then, by 



T3T l and (l53T l. we have a^k = (^Jh^i + iiM, where 



Ci,l Ci,2 • • • Ci^N^ Wi,l Wi^2 • ■ ■ Wi^N^ 



[ 



Ci.l Ci.2 



Ci,N, WiA Wi,2 



(iM ^ X! [0j(a;(ife+i)) - (/)j(x(tfe))] 

j=Ni + l 



■ E 



Wi, 



2r<j)j{x)vidt. 



Since the weights are found using the least-squares method, 
we have 



2 



k=l 



k=l 



Also notice that, 



Proof of Theorem 3.2 



I 



Define 



fc=l k=\ 

Then, under Assumption 13.11 it follows that 



I W.I < 



2 / 4|S,,p 4 



W (5 !<*:<; 



T max 



Therefore, given any arbitrary e > 0, we can find iVio > 
and iV2o > 0, such that when A^i > iVio and > N2a, we 
have 

\V,{x)~V,{x)\ (54) 

Ni oo 

^ Y\ci,j - c,,j\\<j)j{x)\+ ^ \cij(pj{x)\ (55) 

J = l J=Afi + l 

(56) 



< 1 + 1 = ^^^^ 



Similarly, |'Ui+i(a;) — Ui+i(a;)| < e. The proof is complete. ■ 
We now prove Theorem 3.1 by induction: 

1) If i = we have Vo{x) — Vb(x), and ui{x) = ui{x). 
Hence, the convergence can directly be proved by Lemma A. 1 . 

2) Suppose for some i > 0, we have limjVi.Afa-^oo Vi^i{x) = 
Vi-i{x), limjVj^AT^^oo Ui(a;) — Ui{x), Va; G Q,. By definition, 
we have 

/oo 
[u,{xf-u,{xf] dt\ 

/•oo 

+ 2r\ / Ui+i{x)g{x)[ui{x) - Ui{x)]dt\ 



2r\ J [ui+i{x) — Ui+i{x)] g{x)vidt\, Vx £ J7. 



By the induction assumptions, we known 

POO 

lim / \ui(x)'^ — Ui(x)'^] dt — 



(57) 



lim / Ui^i{x)g{x) [ui{x) — Ui{x)] dt = (58) 

Ni,N2-^oo 



(59) 



Also, by Assumption 13. II we conclude 



lim - = 

Ni,N2-^oo 



and 



lim mx)-VAx)\=0. 

Finally, since 

\V,ix) - V,{x)\ < \V,ix) - V,{x)\ + \V,{x) - V,{x)\ 
and by the induction assumption, we have 
lim \Vdx)-M^)\=0. 

Ni , N2 — ^ 00 

The proof is thus complete. 



(60) 



(61) 



Croix), Vi- {x) < d 

0, V^* (x) > d 



and 



u{x) = Ui^{x) + -p'^{\xf)Ut' + i{x) + Croix) 



(62) 



(63) 



Then, along the solutions of (|9]l, by completing the squares, 
we have 

V^'ix) 



^ -Qi^) + Z27TZn^i^ + ^roix))^ 

472 - (A + Croix))' 



pHIxI')' 

-iQix)-e'\x\')- 



pH\xn 



< -Qoix)-4 



72 -max{K2(|w|),K2(|a:|),e2„(|a;|)} 



where Qoix) ~ Qix) — e^jxp is a positive definite function 
of x. 

Therefore, under Assumptions 13.31 13.41 and the gain condi- 
tion ( |26] |. we have the following implication: 

Vi^ix) > ao^-'^oKioX-^iWiw)) 
=> > 7^-^ o Ki o iWiw)) 
=>ji\x\)>Kii\w\) (64) 

=^ li\x\) > max{Ki i\w\) , K2 i\x\) , Cro i\x\)} 
^ V^'ix) < -Qoix). 

Also, under Assumption 13.41 we have 

Wiw) > K3 oa-i(F,.(a:)) 
Wiw)>K3i\x\) 
=^ \/Wiw)Ay,iw,x) < -KiQivl). (65) 

Finally, under the gain condition ( |26] |. it follows that 

7(s) > Ki o A "'^ o K3 o a^^ o a(s) 
^ 7 o a~^(s') > Ki o o K3 o a"^(s') (66) 
s' > a o o Ki o X^^ o K3 o a^'^is') 

where s' = a(s). Hence, the following smaU-gain condition 
holds: 

[a 07-1 o Ki oA"^] o [k3 oa-i(s)] < s, Vs > 0. (67) 

By Theorem 3.1 in |[13], the system (O, ^ is globally 
asymptotically stable at the origin. 

Next, denote xi = a o 7^^ o ki o A^^, and X2 — K30 a^^. 
Also, let xi be a function of class /Coo such that 

1) Xiis) <Xi\s),Vse[0, limxi(s)), 

2) X2(s)<Xi(s), Vs>0. 

Then, as shown in lfT3l . there exists a continuously differ- 
entiable class /Coo function cr(s) satisfying cr'(s) > and 
X2(s) < o'(s) < Xi(s)^ > 0, such that the set 

n,, = {{w, x) : max [aiV^* (x)), W(w)] < d} (68) 



is an estimate of the region of attraction of the closed-loop 
system composed of (|9|l, and (|23]) . 
The proof is thus complete. 



Proof of Theorem 4.2 

Define 

e™i(Xi), U^*iXl)<dl, 
0, U^*{Xi)>du 
ero{x), Ui*{Xi) < di, 
0, U^*{Xi)>di, 

Along the solutions of (l29]l-(l3ni with the control policy 

--^{x)pl{\x,nc 



2p2(|a;|2) 



e C-e™i(^i), 



it follows that 



7f(|Xi|)-max{^?(l^l),^i(l^il),e?„(x)} 



7f(|Xi|)-max{«f(|u;|),^i(|Xi|),g^„(x)} 
-ff{\Xi\)~max{~Kj{\w\),kli\Xi\),el,,{Xi)} 

As a result, 

Ui*{Xi) < max{ai07j"-^oKioA~"^(VK(u;)), ai07j"-^(|t)|)} 
^U^'{X^)<-Qo{x) + ^e^\C\' 

The rest of the proof follows the same reasoning as in the 
proof of Theorem 3.2. 
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